Feature Effects Empirical Analysis - Preliminary Results 2024/04/25¶

Simulation Configuration¶

Simulation parameters:

  • Number of training samples: 1000
  • Models:
    • Random Forest
    • XGBoost
    • Decision Tree
    • SVM (RBF-Kernel)
    • ElasticNet
    • GAM (correctly specified)
  • Noise parameters (sd): 0.1, 0.5
  • Number of simulations per configuration: 5

Additional settings:

  • Number of test samples: 1000
  • Number of tuning trials: 500
  • Number of CV folds for trial evaluation: 5
  • Tuning metric (maximize): neg_mean_squared_error
  • groundtruth_feature_effects: theoretical

Settings for PDP:

  • Grid resolution: 100 (equidistant)
  • percentiles: (0,1)

Settings for ALE:

  • Grid intervals: 99

Model Results¶

In [19]:
model_results_storage = config.get("storage", "model_results")
df = pd.read_sql_table("model_results", f"sqlite:///{base_path + model_results_storage}")
df
Out[19]:
index model_id model simulation n_train noise_sd mse_train mse_test mae_train mae_test r2_train r2_test
0 0 RandomForestRegressor_20240420_1_1000_0.1 RandomForestRegressor 1 1000 0.1 1.033248 2.564450 0.797747 1.262403 0.958388 0.895720
1 0 XGBRegressor_20240420_1_1000_0.1 XGBRegressor 1 1000 0.1 0.000746 0.293658 0.019431 0.406508 0.999970 0.988059
2 0 DecisionTreeRegressor_20240420_1_1000_0.1 DecisionTreeRegressor 1 1000 0.1 1.592632 5.658174 0.985326 1.879092 0.935860 0.769917
3 0 SVR_20240420_1_1000_0.1 SVR 1 1000 0.1 0.009583 0.018337 0.081111 0.103061 0.999614 0.999254
4 0 ElasticNet_20240420_1_1000_0.1 ElasticNet 1 1000 0.1 5.716357 6.326010 1.822135 1.928950 0.769785 0.742760
5 0 GAM_20240420_1_1000_0.1 GAM 1 1000 0.1 0.012509 0.014902 0.086171 0.093055 0.999496 0.999394
6 0 RandomForestRegressor_20240420_1_1000_0.5 RandomForestRegressor 1 1000 0.5 1.164493 2.879359 0.844467 1.334659 0.953601 0.884573
7 0 XGBRegressor_20240420_1_1000_0.5 XGBRegressor 1 1000 0.5 0.017015 0.589488 0.100901 0.600300 0.999322 0.976369
8 0 DecisionTreeRegressor_20240420_1_1000_0.5 DecisionTreeRegressor 1 1000 0.5 1.291587 5.835857 0.864694 1.899462 0.948537 0.766054
9 0 SVR_20240420_1_1000_0.5 SVR 1 1000 0.5 0.217337 0.292701 0.353357 0.434191 0.991340 0.988266
10 0 ElasticNet_20240420_1_1000_0.5 ElasticNet 1 1000 0.5 6.001145 6.663460 1.875507 1.974421 0.760886 0.732877
11 0 GAM_20240420_1_1000_0.5 GAM 1 1000 0.5 0.231005 0.252242 0.381962 0.404557 0.990796 0.989888
12 0 RandomForestRegressor_20240420_2_1000_0.1 RandomForestRegressor 2 1000 0.1 1.000421 2.520993 0.789242 1.251487 0.956347 0.889419
13 0 XGBRegressor_20240420_2_1000_0.1 XGBRegressor 2 1000 0.1 0.000968 0.328785 0.023671 0.416093 0.999958 0.985578
14 0 DecisionTreeRegressor_20240420_2_1000_0.1 DecisionTreeRegressor 2 1000 0.1 1.199006 4.655333 0.846119 1.712575 0.947682 0.795799
15 0 SVR_20240420_2_1000_0.1 SVR 2 1000 0.1 0.009388 0.015729 0.080623 0.097832 0.999590 0.999310
16 0 ElasticNet_20240420_2_1000_0.1 ElasticNet 2 1000 0.1 5.729666 5.603485 1.818694 1.835509 0.749988 0.754209
17 0 GAM_20240420_2_1000_0.1 GAM 2 1000 0.1 0.012891 0.015190 0.087164 0.095487 0.999438 0.999334
18 0 RandomForestRegressor_20240420_2_1000_0.5 RandomForestRegressor 2 1000 0.5 1.095368 2.848674 0.828985 1.325634 0.952306 0.877309
19 0 XGBRegressor_20240420_2_1000_0.5 XGBRegressor 2 1000 0.5 0.027323 0.604914 0.127741 0.597023 0.998810 0.973947
20 0 DecisionTreeRegressor_20240421_2_1000_0.5 DecisionTreeRegressor 2 1000 0.5 2.082978 5.115531 1.116098 1.780645 0.909304 0.779676
21 0 SVR_20240421_2_1000_0.5 SVR 2 1000 0.5 0.218487 0.303596 0.352180 0.435328 0.990487 0.986924
22 0 ElasticNet_20240421_2_1000_0.5 ElasticNet 2 1000 0.5 5.951501 5.899967 1.863729 1.878424 0.740863 0.745891
23 0 GAM_20240421_2_1000_0.5 GAM 2 1000 0.5 0.237038 0.264995 0.390000 0.410258 0.989679 0.988587
24 0 RandomForestRegressor_20240421_3_1000_0.1 RandomForestRegressor 3 1000 0.1 0.958338 2.754806 0.772361 1.314394 0.957607 0.887096
25 0 XGBRegressor_20240421_3_1000_0.1 XGBRegressor 3 1000 0.1 0.002674 0.261840 0.039827 0.369762 0.999882 0.989269
26 0 DecisionTreeRegressor_20240421_3_1000_0.1 DecisionTreeRegressor 3 1000 0.1 1.574273 5.567450 0.982717 1.861938 0.930360 0.771821
27 0 SVR_20240421_3_1000_0.1 SVR 3 1000 0.1 0.010235 0.016686 0.083425 0.098279 0.999547 0.999316
28 0 ElasticNet_20240421_3_1000_0.1 ElasticNet 3 1000 0.1 5.574418 6.058893 1.836146 1.937617 0.753410 0.751680
29 0 GAM_20240421_3_1000_0.1 GAM 3 1000 0.1 0.012934 0.013794 0.088165 0.088148 0.999428 0.999435
30 0 RandomForestRegressor_20240421_3_1000_0.5 RandomForestRegressor 3 1000 0.5 1.061379 2.991197 0.818946 1.358599 0.953455 0.877839
31 0 XGBRegressor_20240421_3_1000_0.5 XGBRegressor 3 1000 0.5 0.017289 0.556007 0.101409 0.577047 0.999242 0.977293
32 0 DecisionTreeRegressor_20240421_3_1000_0.5 DecisionTreeRegressor 3 1000 0.5 1.268283 5.774608 0.875044 1.900316 0.944382 0.764163
33 0 SVR_20240421_3_1000_0.5 SVR 3 1000 0.5 0.234867 0.293101 0.365515 0.431716 0.989700 0.988030
34 0 ElasticNet_20240421_3_1000_0.5 ElasticNet 3 1000 0.5 5.755439 6.248423 1.861620 1.969459 0.747606 0.744813
35 0 GAM_20240421_3_1000_0.5 GAM 3 1000 0.5 0.238331 0.243536 0.386625 0.390700 0.989548 0.990054
36 0 RandomForestRegressor_20240421_4_1000_0.1 RandomForestRegressor 4 1000 0.1 1.027261 2.857581 0.790193 1.343309 0.952974 0.884128
37 0 XGBRegressor_20240421_4_1000_0.1 XGBRegressor 4 1000 0.1 0.001324 0.319565 0.027762 0.433982 0.999939 0.987042
38 0 DecisionTreeRegressor_20240421_4_1000_0.1 DecisionTreeRegressor 4 1000 0.1 1.331648 5.091770 0.859009 1.798283 0.939039 0.793534
39 0 SVR_20240421_4_1000_0.1 SVR 4 1000 0.1 0.009024 0.017813 0.078977 0.104977 0.999587 0.999278
40 0 ElasticNet_20240421_4_1000_0.1 ElasticNet 4 1000 0.1 5.850776 6.116062 1.843441 1.925554 0.732162 0.752000
41 0 GAM_20240421_4_1000_0.1 GAM 4 1000 0.1 0.012342 0.014193 0.086584 0.092846 0.999435 0.999424
42 0 RandomForestRegressor_20240421_4_1000_0.5 RandomForestRegressor 4 1000 0.5 1.069152 3.259622 0.804937 1.426913 0.951168 0.869489
43 0 XGBRegressor_20240421_4_1000_0.5 XGBRegressor 4 1000 0.5 0.025635 0.627217 0.121261 0.624414 0.998829 0.974887
44 0 DecisionTreeRegressor_20240421_4_1000_0.5 DecisionTreeRegressor 4 1000 0.5 1.310757 5.717296 0.862180 1.911294 0.940133 0.771086
45 0 SVR_20240421_4_1000_0.5 SVR 4 1000 0.5 0.219180 0.315867 0.354312 0.446170 0.989989 0.987353
46 0 ElasticNet_20240421_4_1000_0.5 ElasticNet 4 1000 0.5 6.056966 6.462666 1.886121 1.988363 0.723355 0.741243
47 0 GAM_20240421_4_1000_0.5 GAM 4 1000 0.5 0.234881 0.265091 0.382909 0.408994 0.989272 0.989386
48 0 RandomForestRegressor_20240421_5_1000_0.1 RandomForestRegressor 5 1000 0.1 1.114586 2.827651 0.830868 1.313985 0.952512 0.886692
49 0 XGBRegressor_20240421_5_1000_0.1 XGBRegressor 5 1000 0.1 0.000624 0.304266 0.019096 0.421157 0.999973 0.987808
50 0 DecisionTreeRegressor_20240421_5_1000_0.1 DecisionTreeRegressor 5 1000 0.1 1.810721 5.400536 1.041824 1.820424 0.922852 0.783593
51 0 SVR_20240421_5_1000_0.1 SVR 5 1000 0.1 0.009398 0.016618 0.079992 0.102861 0.999600 0.999334
52 0 ElasticNet_20240421_5_1000_0.1 ElasticNet 5 1000 0.1 5.728092 6.305689 1.851228 1.990765 0.755947 0.747322
53 0 GAM_20240421_5_1000_0.1 GAM 5 1000 0.1 0.012423 0.015404 0.084889 0.096822 0.999471 0.999383
54 0 RandomForestRegressor_20240421_5_1000_0.5 RandomForestRegressor 5 1000 0.5 1.188135 3.030339 0.857844 1.361610 0.950160 0.880275
55 0 XGBRegressor_20240421_5_1000_0.5 XGBRegressor 5 1000 0.5 0.020554 0.600762 0.110879 0.616600 0.999138 0.976265
56 0 DecisionTreeRegressor_20240421_5_1000_0.5 DecisionTreeRegressor 5 1000 0.5 1.465710 5.821884 0.937452 1.898664 0.938516 0.769985
57 0 SVR_20240421_5_1000_0.5 SVR 5 1000 0.5 0.229592 0.318557 0.358809 0.454377 0.990369 0.987414
58 0 ElasticNet_20240421_5_1000_0.5 ElasticNet 5 1000 0.5 6.028785 6.550076 1.905356 2.026085 0.747104 0.741215
59 0 GAM_20240421_5_1000_0.5 GAM 5 1000 0.5 0.232448 0.292752 0.376785 0.431478 0.990249 0.988434
In [20]:
%matplotlib inline
boxplot_model_results(metric='mse', df=df);
No description has been provided for this image
In [21]:
%matplotlib inline
boxplot_model_results(metric='mae', df=df);
No description has been provided for this image
In [22]:
%matplotlib inline
boxplot_model_results(metric='r2', df=df);
No description has been provided for this image
  • best models: XGBoost, SVM, GAM
  • test errors only slightly worse than training errors (--> acceptable generalization)
  • as expected smaller error for smaller noise in generated data

PDP Results¶

Error of Model PD compared to Groundtruth PD¶

In [27]:
effects_results_storage = config.get("storage", "effects_results")
df = pd.read_sql_table("pdp_results", f"sqlite:///{base_path + effects_results_storage}")
df
Out[27]:
index model_id model simulation n_train noise_sd metric x_1 x_2 x_3 x_4 x_5
0 0 RandomForestRegressor_20240420_1_1000_0.1 RandomForestRegressor 1 1000 0.1 mean_squared_error 0.180076 0.217016 0.670829 0.110735 0.089038
1 0 XGBRegressor_20240420_1_1000_0.1 XGBRegressor 1 1000 0.1 mean_squared_error 0.033126 0.037456 0.026116 0.017440 0.006373
2 0 DecisionTreeRegressor_20240420_1_1000_0.1 DecisionTreeRegressor 1 1000 0.1 mean_squared_error 0.196085 0.198124 0.788986 0.159278 0.092678
3 0 SVR_20240420_1_1000_0.1 SVR 1 1000 0.1 mean_squared_error 0.015542 0.023117 0.007918 0.000040 0.000365
4 0 ElasticNet_20240420_1_1000_0.1 ElasticNet 1 1000 0.1 mean_squared_error 0.972314 1.002115 2.296801 0.000887 0.001069
5 0 GAM_20240420_1_1000_0.1 GAM 1 1000 0.1 mean_squared_error 0.016391 0.021993 0.007518 0.000297 0.000313
6 0 RandomForestRegressor_20240420_1_1000_0.5 RandomForestRegressor 1 1000 0.5 mean_squared_error 0.183664 0.219501 0.639930 0.101102 0.095004
7 0 XGBRegressor_20240420_1_1000_0.5 XGBRegressor 1 1000 0.5 mean_squared_error 0.038568 0.042233 0.031817 0.021237 0.014175
8 0 DecisionTreeRegressor_20240420_1_1000_0.5 DecisionTreeRegressor 1 1000 0.5 mean_squared_error 0.151053 0.309740 0.646266 0.134242 0.128095
9 0 SVR_20240420_1_1000_0.5 SVR 1 1000 0.5 mean_squared_error 0.014997 0.019612 0.006570 0.000574 0.001520
10 0 ElasticNet_20240420_1_1000_0.5 ElasticNet 1 1000 0.5 mean_squared_error 0.969299 0.996363 2.295735 0.001265 0.000921
11 0 GAM_20240420_1_1000_0.5 GAM 1 1000 0.5 mean_squared_error 0.015177 0.017726 0.006334 0.000574 0.000415
12 0 RandomForestRegressor_20240420_2_1000_0.1 RandomForestRegressor 2 1000 0.1 mean_squared_error 0.161098 0.196815 0.654365 0.292932 0.106797
13 0 XGBRegressor_20240420_2_1000_0.1 XGBRegressor 2 1000 0.1 mean_squared_error 0.032194 0.039635 0.053917 0.072953 0.011180
14 0 DecisionTreeRegressor_20240420_2_1000_0.1 DecisionTreeRegressor 2 1000 0.1 mean_squared_error 0.117335 0.231798 0.756830 0.298415 0.139638
15 0 SVR_20240420_2_1000_0.1 SVR 2 1000 0.1 mean_squared_error 0.012230 0.022169 0.031324 0.051075 0.000940
16 0 ElasticNet_20240420_2_1000_0.1 ElasticNet 2 1000 0.1 mean_squared_error 1.010037 0.979492 2.313948 0.051719 0.001802
17 0 GAM_20240420_2_1000_0.1 GAM 2 1000 0.1 mean_squared_error 0.012763 0.022070 0.031080 0.051827 0.000964
18 0 RandomForestRegressor_20240420_2_1000_0.5 RandomForestRegressor 2 1000 0.5 mean_squared_error 0.163878 0.199857 0.635150 0.298019 0.112221
19 0 XGBRegressor_20240420_2_1000_0.5 XGBRegressor 2 1000 0.5 mean_squared_error 0.030827 0.047840 0.062385 0.080588 0.016794
20 0 DecisionTreeRegressor_20240421_2_1000_0.5 DecisionTreeRegressor 2 1000 0.5 mean_squared_error 0.167582 0.186047 0.948198 0.287861 0.164151
21 0 SVR_20240421_2_1000_0.5 SVR 2 1000 0.5 mean_squared_error 0.008515 0.019569 0.024629 0.041883 0.003447
22 0 ElasticNet_20240421_2_1000_0.5 ElasticNet 2 1000 0.5 mean_squared_error 1.009182 0.979291 2.315197 0.054096 0.002147
23 0 GAM_20240421_2_1000_0.5 GAM 2 1000 0.5 mean_squared_error 0.012904 0.023909 0.032837 0.054312 0.000924
24 0 RandomForestRegressor_20240421_3_1000_0.1 RandomForestRegressor 3 1000 0.1 mean_squared_error 0.245594 0.240970 1.021693 0.211167 0.200089
25 0 XGBRegressor_20240421_3_1000_0.1 XGBRegressor 3 1000 0.1 mean_squared_error 0.097366 0.069190 0.130603 0.053838 0.074659
26 0 DecisionTreeRegressor_20240421_3_1000_0.1 DecisionTreeRegressor 3 1000 0.1 mean_squared_error 0.276601 0.447049 0.870213 0.292748 0.185197
27 0 SVR_20240421_3_1000_0.1 SVR 3 1000 0.1 mean_squared_error 0.086705 0.056216 0.113306 0.047243 0.067375
28 0 ElasticNet_20240421_3_1000_0.1 ElasticNet 3 1000 0.1 mean_squared_error 1.055142 1.059413 2.433883 0.050307 0.072384
29 0 GAM_20240421_3_1000_0.1 GAM 3 1000 0.1 mean_squared_error 0.085922 0.055878 0.111484 0.047625 0.068099
30 0 RandomForestRegressor_20240421_3_1000_0.5 RandomForestRegressor 3 1000 0.5 mean_squared_error 0.241821 0.239866 1.032554 0.208625 0.202284
31 0 XGBRegressor_20240421_3_1000_0.5 XGBRegressor 3 1000 0.5 mean_squared_error 0.105632 0.091902 0.138890 0.064882 0.078063
32 0 DecisionTreeRegressor_20240421_3_1000_0.5 DecisionTreeRegressor 3 1000 0.5 mean_squared_error 0.241727 0.407739 0.943927 0.261217 0.184828
33 0 SVR_20240421_3_1000_0.5 SVR 3 1000 0.5 mean_squared_error 0.092054 0.059300 0.116067 0.048360 0.069405
34 0 ElasticNet_20240421_3_1000_0.5 ElasticNet 3 1000 0.5 mean_squared_error 1.051782 1.058547 2.432781 0.049848 0.071648
35 0 GAM_20240421_3_1000_0.5 GAM 3 1000 0.5 mean_squared_error 0.083994 0.056006 0.113014 0.047311 0.067380
36 0 RandomForestRegressor_20240421_4_1000_0.1 RandomForestRegressor 4 1000 0.1 mean_squared_error 0.180954 0.208171 0.946193 0.135369 0.152165
37 0 XGBRegressor_20240421_4_1000_0.1 XGBRegressor 4 1000 0.1 mean_squared_error 0.030425 0.033431 0.030240 0.017248 0.011525
38 0 DecisionTreeRegressor_20240421_4_1000_0.1 DecisionTreeRegressor 4 1000 0.1 mean_squared_error 0.204244 0.170217 1.001219 0.193011 0.105438
39 0 SVR_20240421_4_1000_0.1 SVR 4 1000 0.1 mean_squared_error 0.016255 0.003140 0.000335 0.005028 0.000161
40 0 ElasticNet_20240421_4_1000_0.1 ElasticNet 4 1000 0.1 mean_squared_error 0.982853 1.038563 2.296412 0.006162 0.000911
41 0 GAM_20240421_4_1000_0.1 GAM 4 1000 0.1 mean_squared_error 0.015841 0.003527 0.000282 0.005527 0.000175
42 0 RandomForestRegressor_20240421_4_1000_0.5 RandomForestRegressor 4 1000 0.5 mean_squared_error 0.174299 0.212195 0.975258 0.113023 0.161780
43 0 XGBRegressor_20240421_4_1000_0.5 XGBRegressor 4 1000 0.5 mean_squared_error 0.035058 0.040828 0.032653 0.021883 0.017901
44 0 DecisionTreeRegressor_20240421_4_1000_0.5 DecisionTreeRegressor 4 1000 0.5 mean_squared_error 0.191617 0.182884 0.866188 0.191907 0.117916
45 0 SVR_20240421_4_1000_0.5 SVR 4 1000 0.5 mean_squared_error 0.015504 0.007057 0.001739 0.008856 0.002945
46 0 ElasticNet_20240421_4_1000_0.5 ElasticNet 4 1000 0.5 mean_squared_error 0.987363 1.053859 2.296736 0.007183 0.002124
47 0 GAM_20240421_4_1000_0.5 GAM 4 1000 0.5 mean_squared_error 0.015893 0.006161 0.004430 0.006297 0.000623
48 0 RandomForestRegressor_20240421_5_1000_0.1 RandomForestRegressor 5 1000 0.1 mean_squared_error 0.259675 0.141057 0.738519 0.204913 0.150057
49 0 XGBRegressor_20240421_5_1000_0.1 XGBRegressor 5 1000 0.1 mean_squared_error 0.020060 0.014473 0.032888 0.020851 0.012283
50 0 DecisionTreeRegressor_20240421_5_1000_0.1 DecisionTreeRegressor 5 1000 0.1 mean_squared_error 0.133456 0.268438 0.926517 0.226013 0.142767
51 0 SVR_20240421_5_1000_0.1 SVR 5 1000 0.1 mean_squared_error 0.001317 0.001441 0.005647 0.000681 0.000705
52 0 ElasticNet_20240421_5_1000_0.1 ElasticNet 5 1000 0.1 mean_squared_error 0.997963 0.971906 2.301234 0.001179 0.007655
53 0 GAM_20240421_5_1000_0.1 GAM 5 1000 0.1 mean_squared_error 0.002960 0.001517 0.005619 0.000845 0.000446
54 0 RandomForestRegressor_20240421_5_1000_0.5 RandomForestRegressor 5 1000 0.5 mean_squared_error 0.273428 0.130831 0.671128 0.196193 0.152825
55 0 XGBRegressor_20240421_5_1000_0.5 XGBRegressor 5 1000 0.5 mean_squared_error 0.023066 0.012977 0.037802 0.026831 0.014318
56 0 DecisionTreeRegressor_20240421_5_1000_0.5 DecisionTreeRegressor 5 1000 0.5 mean_squared_error 0.148548 0.164499 0.767788 0.246336 0.111892
57 0 SVR_20240421_5_1000_0.5 SVR 5 1000 0.5 mean_squared_error 0.002123 0.001258 0.009890 0.002051 0.000985
58 0 ElasticNet_20240421_5_1000_0.5 ElasticNet 5 1000 0.5 mean_squared_error 0.996362 0.970059 2.305862 0.001735 0.007606
59 0 GAM_20240421_5_1000_0.5 GAM 5 1000 0.5 mean_squared_error 0.003468 0.000503 0.018058 0.002073 0.001189
In [28]:
%matplotlib inline
boxplot_feature_effect_results(features=["x_1", "x_2", "x_3", "x_4", "x_5"], df=df, effect_type="PDP");
No description has been provided for this image

PDP example visualizations¶

(simulation no. 1 with n_train=1000 and noise_sd=0.1)

In [35]:
%matplotlib inline
plot_effect_comparison(rf, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [36]:
%matplotlib inline
plot_effect_comparison(xgb, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [37]:
%matplotlib inline
plot_effect_comparison(tree, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [38]:
%matplotlib inline
plot_effect_comparison(svm, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [39]:
%matplotlib inline
plot_effect_comparison(elasticnet, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [40]:
%matplotlib inline
plot_effect_comparison(gam, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image

Interesting observation:

Visible difference between the true (theoretical) partial dependence and the partial dependence estimated on the groundtruth

In [41]:
# GAM PD vs. Theoretical Groundtruth PD
%matplotlib inline
plot_effect_comparison(gam, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="theoretical", config=config);
No description has been provided for this image
In [42]:
# GAM vs. Empirical Groundtruth PD
%matplotlib inline
plot_effect_comparison(gam, groundtruth, X_train, effect="PDP", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image

ALE¶

The results for ALE are compared to an estimated/empirically calculated groundtruth feature effect, i.e. by applying the usual ALE estimation directly on the groundtruth function.

Error of Model-ALE compared to groundtruth-ALE¶

In [44]:
effects_results_storage = config.get("storage", "effects_results")
df = pd.read_sql_table("ale_results", f"sqlite:///{base_path2 + effects_results_storage}")
df
Out[44]:
index model_id model simulation n_train noise_sd metric x_1 x_2 x_3 x_4 x_5
0 0 RandomForestRegressor_20240413_1_1000_0.1 RandomForestRegressor 1 1000 0.1 mean_squared_error 0.102209 0.078024 0.504765 0.045459 0.022509
1 0 XGBRegressor_20240413_1_1000_0.1 XGBRegressor 1 1000 0.1 mean_squared_error 0.088313 0.032658 0.044790 0.033427 0.033133
2 0 DecisionTreeRegressor_20240413_1_1000_0.1 DecisionTreeRegressor 1 1000 0.1 mean_squared_error 0.450111 0.341412 0.613983 2.657984 0.412017
3 0 SVR_20240413_1_1000_0.1 SVR 1 1000 0.1 mean_squared_error 0.000037 0.000040 0.000218 0.000056 0.000214
4 0 ElasticNet_20240413_1_1000_0.1 ElasticNet 1 1000 0.1 mean_squared_error 0.856908 1.085617 2.223995 0.000952 0.001000
5 0 GAM_20240413_1_1000_0.1 GAM 1 1000 0.1 mean_squared_error 0.000334 0.000265 0.000089 0.000320 0.000190
6 0 RandomForestRegressor_20240413_1_1000_0.5 RandomForestRegressor 1 1000 0.5 mean_squared_error 0.107340 0.078632 0.481248 0.041882 0.039674
7 0 XGBRegressor_20240413_1_1000_0.5 XGBRegressor 1 1000 0.5 mean_squared_error 0.094882 0.087883 0.099508 0.028062 0.082043
8 0 DecisionTreeRegressor_20240413_1_1000_0.5 DecisionTreeRegressor 1 1000 0.5 mean_squared_error 0.332879 0.326019 0.512040 0.473676 0.305222
9 0 SVR_20240413_1_1000_0.5 SVR 1 1000 0.5 mean_squared_error 0.000317 0.001205 0.001456 0.000398 0.001531
10 0 ElasticNet_20240413_1_1000_0.5 ElasticNet 1 1000 0.5 mean_squared_error 0.856576 1.088063 2.224006 0.000951 0.000866
11 0 GAM_20240413_1_1000_0.5 GAM 1 1000 0.5 mean_squared_error 0.000803 0.001239 0.001782 0.000254 0.000386
12 0 RandomForestRegressor_20240413_2_1000_0.1 RandomForestRegressor 2 1000 0.1 mean_squared_error 0.108499 0.088505 0.553368 0.113833 0.017661
13 0 XGBRegressor_20240413_2_1000_0.1 XGBRegressor 2 1000 0.1 mean_squared_error 0.069223 0.066510 0.016483 0.031998 0.052082
14 0 DecisionTreeRegressor_20240413_2_1000_0.1 DecisionTreeRegressor 2 1000 0.1 mean_squared_error 0.758521 0.379145 0.758179 1.960838 0.104751
15 0 SVR_20240413_2_1000_0.1 SVR 2 1000 0.1 mean_squared_error 0.000037 0.000046 0.000136 0.000176 0.000047
16 0 ElasticNet_20240413_2_1000_0.1 ElasticNet 2 1000 0.1 mean_squared_error 0.944539 0.992477 2.259133 0.000714 0.000681
17 0 GAM_20240413_2_1000_0.1 GAM 2 1000 0.1 mean_squared_error 0.000256 0.000292 0.000179 0.000849 0.000138
18 0 RandomForestRegressor_20240413_2_1000_0.5 RandomForestRegressor 2 1000 0.5 mean_squared_error 0.087106 0.091529 0.559728 0.105091 0.022528
19 0 XGBRegressor_20240413_2_1000_0.5 XGBRegressor 2 1000 0.5 mean_squared_error 0.052993 0.062248 0.022821 0.032069 0.044360
20 0 DecisionTreeRegressor_20240413_2_1000_0.5 DecisionTreeRegressor 2 1000 0.5 mean_squared_error 0.657025 2.455691 0.773211 0.891441 0.239885
21 0 SVR_20240413_2_1000_0.5 SVR 2 1000 0.5 mean_squared_error 0.001267 0.001508 0.000990 0.001604 0.001187
22 0 ElasticNet_20240413_2_1000_0.5 ElasticNet 2 1000 0.5 mean_squared_error 0.943744 0.991152 2.261525 0.003615 0.000959
23 0 GAM_20240413_2_1000_0.5 GAM 2 1000 0.5 mean_squared_error 0.000709 0.001130 0.002092 0.003852 0.000032
24 0 RandomForestRegressor_20240413_3_1000_0.1 RandomForestRegressor 3 1000 0.1 mean_squared_error 0.084576 0.174632 0.687282 0.080984 0.044223
25 0 XGBRegressor_20240413_3_1000_0.1 XGBRegressor 3 1000 0.1 mean_squared_error 0.051716 0.009275 0.013223 0.032149 0.023333
26 0 DecisionTreeRegressor_20240413_3_1000_0.1 DecisionTreeRegressor 3 1000 0.1 mean_squared_error 0.456843 0.483704 0.601405 0.296938 0.217799
27 0 SVR_20240413_3_1000_0.1 SVR 3 1000 0.1 mean_squared_error 0.000176 0.000246 0.000082 0.000033 0.000023
28 0 ElasticNet_20240413_3_1000_0.1 ElasticNet 3 1000 0.1 mean_squared_error 1.028461 0.980727 2.227729 0.002826 0.007909
29 0 GAM_20240413_3_1000_0.1 GAM 3 1000 0.1 mean_squared_error 0.000408 0.000215 0.000224 0.000595 0.000148
30 0 RandomForestRegressor_20240413_3_1000_0.5 RandomForestRegressor 3 1000 0.5 mean_squared_error 0.097168 0.163768 0.701964 0.081795 0.051585
31 0 XGBRegressor_20240414_3_1000_0.5 XGBRegressor 3 1000 0.5 mean_squared_error 0.027236 0.026640 0.015414 0.031269 0.022480
32 0 DecisionTreeRegressor_20240414_3_1000_0.5 DecisionTreeRegressor 3 1000 0.5 mean_squared_error 0.301233 0.257206 1.141532 0.448084 0.425581
33 0 SVR_20240414_3_1000_0.5 SVR 3 1000 0.5 mean_squared_error 0.000716 0.003037 0.001180 0.000910 0.000937
34 0 ElasticNet_20240414_3_1000_0.5 ElasticNet 3 1000 0.5 mean_squared_error 1.022676 0.980959 2.227729 0.002991 0.007931
35 0 GAM_20240414_3_1000_0.5 GAM 3 1000 0.5 mean_squared_error 0.000905 0.001394 0.004704 0.000845 0.000172
36 0 RandomForestRegressor_20240414_4_1000_0.1 RandomForestRegressor 4 1000 0.1 mean_squared_error 0.067455 0.131734 0.796294 0.037615 0.065405
37 0 XGBRegressor_20240414_4_1000_0.1 XGBRegressor 4 1000 0.1 mean_squared_error 0.058053 0.071941 0.015962 0.037490 0.089937
38 0 DecisionTreeRegressor_20240414_4_1000_0.1 DecisionTreeRegressor 4 1000 0.1 mean_squared_error 0.251189 0.250814 0.793227 0.414049 1.172958
39 0 SVR_20240414_4_1000_0.1 SVR 4 1000 0.1 mean_squared_error 0.000075 0.000123 0.000288 0.000091 0.000137
40 0 ElasticNet_20240414_4_1000_0.1 ElasticNet 4 1000 0.1 mean_squared_error 1.225967 0.968002 2.184606 0.001146 0.000868
41 0 GAM_20240414_4_1000_0.1 GAM 4 1000 0.1 mean_squared_error 0.000220 0.000426 0.000248 0.000495 0.000164
42 0 RandomForestRegressor_20240414_4_1000_0.5 RandomForestRegressor 4 1000 0.5 mean_squared_error 0.060786 0.146011 0.788864 0.036007 0.079909
43 0 XGBRegressor_20240414_4_1000_0.5 XGBRegressor 4 1000 0.5 mean_squared_error 0.060250 0.068185 0.041078 0.048566 0.063810
44 0 DecisionTreeRegressor_20240414_4_1000_0.5 DecisionTreeRegressor 4 1000 0.5 mean_squared_error 0.795661 0.524065 0.511875 0.671951 0.346964
45 0 SVR_20240414_4_1000_0.5 SVR 4 1000 0.5 mean_squared_error 0.000692 0.002709 0.001586 0.002086 0.002741
46 0 ElasticNet_20240414_4_1000_0.5 ElasticNet 4 1000 0.5 mean_squared_error 1.230699 0.978949 2.184610 0.001238 0.001969
47 0 GAM_20240414_4_1000_0.5 GAM 4 1000 0.5 mean_squared_error 0.000418 0.002201 0.004158 0.000320 0.000533
48 0 RandomForestRegressor_20240414_5_1000_0.1 RandomForestRegressor 5 1000 0.1 mean_squared_error 0.082736 0.110605 0.610456 0.152503 0.071705
49 0 XGBRegressor_20240414_5_1000_0.1 XGBRegressor 5 1000 0.1 mean_squared_error 0.137153 0.118048 0.031283 0.070292 0.065766
50 0 DecisionTreeRegressor_20240414_5_1000_0.1 DecisionTreeRegressor 5 1000 0.1 mean_squared_error 0.452512 0.360682 1.111724 1.219658 0.374757
51 0 SVR_20240414_5_1000_0.1 SVR 5 1000 0.1 mean_squared_error 0.000037 0.000121 0.000236 0.000031 0.000208
52 0 ElasticNet_20240414_5_1000_0.1 ElasticNet 5 1000 0.1 mean_squared_error 0.961997 1.045239 2.380695 0.000765 0.007340
53 0 GAM_20240414_5_1000_0.1 GAM 5 1000 0.1 mean_squared_error 0.000359 0.000281 0.000510 0.000422 0.000125
54 0 RandomForestRegressor_20240414_5_1000_0.5 RandomForestRegressor 5 1000 0.5 mean_squared_error 0.086741 0.107020 0.513056 0.141217 0.042970
55 0 XGBRegressor_20240414_5_1000_0.5 XGBRegressor 5 1000 0.5 mean_squared_error 0.055380 0.080611 0.047591 0.095212 0.016603
56 0 DecisionTreeRegressor_20240414_5_1000_0.5 DecisionTreeRegressor 5 1000 0.5 mean_squared_error 0.598115 0.313542 0.728063 4.416162 0.187126
57 0 SVR_20240414_5_1000_0.5 SVR 5 1000 0.5 mean_squared_error 0.000359 0.002813 0.004640 0.001000 0.000200
58 0 ElasticNet_20240414_5_1000_0.5 ElasticNet 5 1000 0.5 mean_squared_error 0.960468 1.047228 2.386402 0.000374 0.006440
59 0 GAM_20240414_5_1000_0.5 GAM 5 1000 0.5 mean_squared_error 0.001030 0.001030 0.009240 0.000720 0.000024
In [46]:
%matplotlib inline
boxplot_feature_effect_results(features=["x_1", "x_2", "x_3", "x_4", "x_5"], df=df, effect_type="ALE");
No description has been provided for this image

ALE example visualizations¶

(simulation no. 1 with n_train=1000 and noise_sd=0.1)

In [49]:
%matplotlib inline
plot_effect_comparison(rf, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image
In [50]:
%matplotlib inline
plot_effect_comparison(xgb, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image
In [51]:
%matplotlib inline
plot_effect_comparison(tree, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image
In [52]:
%matplotlib inline
plot_effect_comparison(svm, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image
In [53]:
%matplotlib inline
plot_effect_comparison(elasticnet, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image
In [54]:
%matplotlib inline
plot_effect_comparison(gam, groundtruth, X_train, effect="ALE", features=['x_1', "x_2", "x_3", "x_4", "x_5"], groundtruth_feature_effect="empirical", config=config);
No description has been provided for this image

Open Questions¶

  • number of simulation runs? (currently: 5; for more: server access?)
  • size of training dataset? try several sizes? (currently: 1000)
  • sensible values for noise param? (currently: 0.1, 0.5)
  • how many grid points? how to choose grid? (currently: 100, equidistant)
  • how to compute true/theoretical groundtruth feature effect for ALE?
  • reason behind the difference between empirical and theoretical groundtruth feature effect? (possibly: sample mean of X deviates a little bit from theoretial expected value, leading to shift of the curves (if multiplied by e.g. 10 for $x_4$))

Further Ideas¶

  • different distributions for X to explore PDs/ALEs with less data in some areas
  • use data with correlated features
  • use other datasets (more real-world like datasets)
  • warmstart tuning for better efficiency
  • 2nd order effects / 2D feature effects
  • analyse model error compared to error on pdps/ales - "correlation"?